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Abstract 

We study the fundamental limits to communication-efficient distributed methods for convex learning 
and optimization, under different assumptions on the information available to individual machines, and 
the types of functions considered. We identify cases where existing algorithms are already worst-case 
optimal, as well as cases where room for further improvement is still possible. Among other things, our 
results indicate that without similarity between the local objective functions (due to statistical data simi¬ 
larity or otherwise) many communication rounds may be required, even if the machines have unbounded 
computational power. 


1 Introduction 

We consider the problem of distributed convex learning and optimization, where a set of m machines, each 
with access to a different local convex function Fj : i—>• M and a convex domain W C attempt to 

solve the optimization problem 

^ m 

min Fiw) where F(w) = — > FAw). (1) 

weW 

2 = 1 

A prominent application is empirical risk minimization, where the goal is to minimize the average loss 
over some dataset, where each machine has access to a different subset of the data. Letting {zi,..., z^v} 
be the dataset composed of N examples, and assuming the loss function f (w, z) is convex in w, then the 
empirical risk minimization problem minwew be written as in Eq. ([I]), where Fi{w) 

is the average loss over machine i’s examples. 

The main challenge in solving such problems is that communication between the different machines 
is usually slow and constrained, at least compared to the speed of local processing. On the other hand, the 
datasets involved in distributed learning are usually large and high-dimensional. Therefore, machines cannot 
simply communicate their entire data to each other, and the question is how well can we solve problems such 
as Eq. O using as little communication as possible. 

As datasets continue to increase in size, and parallel computing platforms becoming more and more 
common (from multiple cores on a single CPU to large-scale and geographically distributed computing 
grids), distributed learning and optimization methods have been the focus of much research in recent years, 
with just a few examples including 1251 01121 UTl 01 |5l 03l |23l 06l 071 El |7l IH 011 |20l 09l El EH. Most 
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of this work studied algorithms for this problem, which provide upper bounds on the required time and 
communication complexity. 

In this paper, we take the opposite direction, and study what are the fundamental performance limitations 
in solving Eq. ([ill, under several different sets of assumptions. We identify cases where existing algorithms 
are already optimal (at least in the worst-case), as well as cases where room for further improvement is still 
possible. 

Since a major constraint in distributed learning is communication, we focus on studying the amount of 
communication required to optimize Eq. ([T]) up to some desired accuracy e. More precisely, we consider the 
number of communication rounds that are required, where in each communication round the machines can 
generally broadcast to each other information linear in the problem’s dimension d (e.g. a point in W or a 
gradient). This applies to virtually all algorithms for large-scale learning we are aware of, where sending 
vectors and gradients is feasible, but computing and sending larger objects, such as Hessians (dxd matrices) 
is not. 

Our results pertain to several possible settings (see Sec. |2] for precise definitions). Eirst, we distinguish 
between the local functions being merely convex or strongly-convex, and whether they are smooth or not. 
These distinctions are standard in studying optimization algorithms for learning, and capture important 
properties such as the regularization and the type of loss function used. Second, we distinguish between 
a setting where the local functions are related - e.g., because they reflect statistical similarities in the data 
residing at different machines - and a setting where no relationship is assumed. Eor example, in the extreme 
case where data was split uniformly at random between machines, one can show that quantities such as the 
values, gradients and Hessians of the local functions differ only by <5 = 0{l/^/n), where n is the sample 
size per machine, due to concentration of measure effects. Such similarities can be used to speed up the 
optimization/learning process, as was done in e.g. Il20ll2^ . Both the (5-related and the uru'elated setting can 
be considered in a unified way, by letting <5 be a parameter and studying the attainable lower bounds as a 
function of 6. Our results can be summarized as follows: 

• Eirst, we define a mild structural assumption on the algorithm (which is satisfied by reasonable approaches 
we are aware of), which allows us to provide the lower bounds described below on the number of com¬ 
munication rounds required to reach a given suboptimality e. 

- When the local functions can be unrelated, we prove a lower bound of n(y^l/A log(l/e)) for smooth 

and A-strongly convex functions, and ^}{y^T/e) for smooth convex functions. These lower bounds are 
matched by a straightforward distributed implementation of accelerated gradient descent. In particular, 
the results imply that many communication rounds may be required to get a high-accuracy solution, 
and moreover, that no algorithm satisfying our structural assumption would be better, even if we en¬ 
dow the local machines with unbounded computational power. Eor non-smooth functions, we show a 
lower bound of n(y^l/Ae) for A-strongly convex functions, and for general convex functions. 

Although we leave a full derivation to future work, it seems these lower bounds can be matched in 
our framework by an algorithm combining acceleration and Moreau proximal smoothing of the local 
functions. 

- When the local functions are related (as quantified by the parameter (5), we prove a communication 
round lower bound of n(y^(i/Alog(l/e)) for smooth and A-strongly convex functions. Eor quadratics, 
this bound is matched by (up to constants and logarithmic factors) by the recently-proposed DISCO al¬ 
gorithm im. However, getting an optimal algorithm for general strongly convex and smooth functions 
in the J-related setting, let alone for non-smooth or non-strongly convex functions, remains open. 
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• We also study the attainable performance without posing any structural assumptions on the algorithm, but 
in the more restricted case where only a single round of communication is allowed. We prove that in a 
broad regime, the performance of any distributed algorithm may be no better than a ‘trivial’ algorithm 
which returns the minimizer of one of the local functions, as long as the number of bits communicated is 
less than Q{d?). Therefore, in our setting, no communication-efficient 1-round distributed algorithm can 
provide non-trivial performance in the worst case. 

Related Work 

There have been several previous works which considered lower bounds in the context of distributed learning 
and optimization, but to the best of our knowledge, none of them provide a similar type of results. Perhaps 
the most closely-related paper is |[22l . which studied the communication complexity of distributed opti¬ 
mization, and showed that n(dlog(l/e)) bits of communication are necessary between the machines, for 
d-dimensional convex problems. However, in our setting this does not lead to any non-trivial lower bound 
on the number of communication rounds (indeed, just specifying a d-dimensional vector up to accuracy e 
required 0(dlog(l/e)) bits). More recently, considered lower bounds for certain types of distributed 
learning problems, but not convex ones in an agnostic distribution-free framework. In the context of lower 
bounds for one-round algorithms, the results of ||6| imply that n(d^) bits of communication are required 
to solve linear regression in one round of communication. However, that paper assumes a different model 
than ours, where the function to be optimized is not split among the machines as in Eq. ([Hi, where each 
Fi is convex. Moreover, issues such as strong convexity and smoothness are not considered. ll20l proves 
an impossibility result for a one-round distributed leai'ning scheme, even when the local functions are not 
merely related, but actually result from splitting data uniformly at random between machines. On the flip 
side, that result is for a particular algorithm, and doesn’t apply to any possible method. 

Finally, we emphasize that distributed learning and optimization can be studied under many settings, 
including ones different than those studied here. For example, one can consider distributed learning on a 
stream of i.i.d. data lIT^rTlfTOlISl. or settings where the computing architecture is different, e.g. where the 
machines have a shared memory, or the function to be optimized is not split as in Eq. ([H. Studying lower 
bounds in such settings is an interesting topic for future work. 

2 Notation and Framework 

The only vector and matrix norms used in this paper are the Euclidean norm and the spectral norm, re¬ 
spectively. Oj denotes the y-th standard unit vector. We let V(S'(w) and V^G(w^) denote the gradient and 
Hessians of a function G at w, if they exist. G is smooth (with parameter L) if it is differentiable and the gra¬ 
dient is L-Eipschitz. In particular, if w* = argminwew G'(w), then G(w) — G(w* ) ^ ■§ ||w — w* 11^. G 
is strongly convex (with parameter A) if for any w, w' S W, G{w') > G{w) -|- (g, w' — w) ^ || w' “ w|p 
where g £ dG{w') is a subgradient of G at w. In particular, if w* = argminwevv G(w), then G{w) — 
G(w*) > -I ||w — w* Ip. Any convex function is also strongly-convex with A = 0. A special case of smooth 
convex functions are quadratics, where G(w) = w^A-w -|- b^w -|- c for some positive semidefinite matrix 
A, vector b and scalar c. In this case, A and L correspond to the smallest and largest eigenvalues of A. 

We model the distributed learning algorithm as an iterative process, where in each round the machines 
may perform some local computations, followed by a communication round where each machine broad¬ 
casts a message to all other machines. We make no assumptions on the computational complexity of the 
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local computations. After all communication rounds are completed, a designated machine provides the 
algorithm’s output (possibly after additional local computation). 

Clearly, without any assumptions on the number of bits communicated, the problem can be trivially 
solved in one round of communication (e.g. each machine communicates the function Fi to the designated 
machine, which then solves Eq. ([T]). However, in practical large-scale scenarios, this is non-feasible, and the 
size of each message (measured by the number of bits) is typically on the order of 0{d), enough to send 
a d-dimensional real-valued vectoo such as points in the optimization domain or gradients, but not larger 
objects such as d x d Hessians. 

In this model, our main question is the following: How many rounds of communication are necessary 
in order to solve problems such as Eq. ([1]) to some given accuracy e? 

As discussed in the introduction, we first need to distinguish between different assumptions on the 
possible relation between the local functions. One natural situation is when no significant relationship can 
be assumed, for instance when the data is arbitrarily split or is gathered by each machine from statistically 
dissimilar sources. We denote this as the unrelated setting. However, this assumption is often unnecessarily 
pessimistic. Often the data allocation process is more random, or we can assume that the different data 
sources for each machine have statistical similarities (to give a simple example, consider learning from 
users’ activity across a geographically distributed computing grid, each servicing its own local population). 
We will capture such similarities, in the context of quadratic functions, using the following definition: 

Definition 1. Wc say that a set of quadratic functions 

Fj(w) := w''~AiW -|- bjW -f Cj, A* G bj G Cj G M 

are 5-related, if for any i,j G {1... k}, it holds that 

\\Ai — AjW < 6, ||bj — bjll < 5, \ci — Cj\ < 6 

Eor example, in the context of linear regression with the squared loss over a bounded subset of and 
assuming mn data points with bounded norm are randomly and equally split among m machines, it can be 
shown that the conditions above hold with 6 = 0(1/ y/n) HQI. The choice of 6 provides us with a spectrum 
of learning problems ranked by difficulty: When <5 = H(l), this generally corresponds to the unrelated 
setting discussed earlier. When 6 = 0(l/yTr), we get the situation typical of randomly partitioned data. 
When d = 0, then all the local functions have essentially the same minimizers, in which case Eq. O can 
be trivially solved with zero communication, just by letting one machine optimize its own local function. 
We note that although Definition [T] can be generalized to non-quadratic functions, we do not need it for the 
results presented here. 

We end this section with an important remai'k. In this paper, we prove lower bounds for the (i-related 
setting, which includes as a special case the commonly-studied setting of randomly partitioned data (in 
which case 6 = 0{l/^/n)). However, our bounds do not apply for random partitioning, since they use 
(i-related constructions which do not correspond to randomly partitioned data. In fact, very recent work ifT^ 
has cleverly shown that for randomly partitioned data, and for certain reasonable regimes of strong convexity 
and smoothness, it is actually possible to get better performance than what is indicated by our lower bounds. 
However, this encouraging result crucially relies on the random partition property, and in parameter regimes 
which limit how much each data point needs to be “touched”, hence preserving key statistical independence 
properties. We suspect that it may be difficult to improve on our lower bounds under substantially weaker 
assumptions. 

*The O hides constants and factors logarithmic in the required accuracy of the solution. The idea is that we can represent real 
numbers up to some arbitrarily high machine precision, enough so that finite-precision issues are not a problem. 
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3 Lower Bounds Using a Structural Assumption 


In this section, we present lower bounds on the number of communication rounds, where we impose a certain 
mild structural assumption on the operations performed by the algorithm. Roughly speaking, our lower 
bounds pertain to a very large class of algorithms, which are based on linear operations involving points, 
gradients, and vector products with local Hessians and their inverses, as well as solving local optimization 
problems involving such quantities. At each communication round, the machines can share any of the 
vectors they have computed so far. Formally, we consider algorithms which satisfy the assumption stated 
below. For convenience, we state it for smooth functions (which are differentiable) and discuss the case of 
non-smooth functions in Sec. 13.21 

Assumption 1. For each machine j, define a set Wj C initially Wj = {0}. Between communication 
rounds, each machine j iteratively computes and adds to Wj some finite number of points w, each satisfying 

7 w + i'VFj{w) e span |w' , VFj(w') , (V^Fj(w') + D)W' , (V^Fj(w') + D)~^W' 

w',w" S FFj , D diagonal , V^Fj{W) exists , {V^Fj{w') + D)~^ exists'^. (2) 

for some > 0 such that j + a > 0. After every communication round, let Wj := VJffiWifor all j. The 
algorithm’s final output (provided by the designated machine j) is a point in the span ofWj. 

This assumption requires several remarks: 

• Note that Wj is not an explicit part of the algorithm: It simply includes all points computed by machine 
j so far, or communicated to it by other machines, and is used to define the set of new points which the 
machine is allowed to compute. 

• The assumption bears some resemblance - but is far weaker - than standard assumptions used to provide 

lower bounds for iterative optimization algorithms. For example, a common assumption (see ifldll ') is that 
each computed point w must lie in the span of the previous gradients. This corresponds to a special case 
of Assumption [H where 7 = 1, = 0, and the span is only over gradients of previously computed points. 

Moreover, it also allows (for instance) exact optimization of each local function, which is a subroutine in 
some distributed algorithms (e.g. Il27ll25]l ). by setting 7 = 0, = 1 and computing a point w satisfying 

7 w -I- i'VFj{w) = 0. By allowing the span to include previous gradients, we also incorporate algorithms 
which perform optimization of the local function plus terms involving previous gradients and points, such 
as iia, as well as algorithms which rely on local Hessian information and preconditioning, such as E6l . 
In summary, the assumption is satisfied by mosf fechniques for black-box convex opfimizafion fhaf we 
are aware of. Finally, we emphasize fhaf we do nof resfricf fhe number or compufafional complexify of 
fhe operations performed befween communication rounds. 

• The requiremenf fhaf 7 , 1 / > 0 is fo exclude algorifhms which solve non-convex local opfimizafion prob¬ 
lems of fhe form min^ Fj(-w) + 7 ||w|p wifh 7 < 0 , which are unreasonable in practice and can some¬ 
times break our lower bounds. 

• The assumption fhaf Wj is inifially {0} (namely, fhaf fhe algorifhm sfarfs from fhe origin) is purely for 
convenience, and our resulfs can be easily adapted fo any ofher sfaiting poinf by shifling all funclions 
accordingly. 
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The techniques we employ in this section are inspired by lower bounds on the iteration complexity of 
first-order methods for standard (non-distributed) optimization (see for example lUdlD . These are based on 
the construction of ‘hard’ functions, where each gradient (or subgradient) computation can only provide a 
small improvement in the objective value. In our setting, the dynamics are roughly similar, but the necessity 
of many gradient computations is replaced by many communication rounds. This is achieved by constructing 
suitable local functions, where at any time point no individual machine can ‘progress’ on its own, without 
information from other machines. 


3.1 Smooth Local Functions 

We begin by presenting a lower bound when the local functions Fi are strongly-convex and smooth: 

Theorem 1. For any even number m of machines, any distributed algorithm which satisfies Assumption\I\ 
and for any A G [0,1), <5 G (0,1), there exist m local quadratic functions over (where d is sufficiently 
large) which are 1-smooth, \-strongly convex, and 6-related, such that ifw* = arg min.^^g]gd T’(w), then 
the number of communication rounds required to obtain w satisfying F{'w) — F{'w*) < e (for any e > 0) 
is at least 



if X > 0, and at least y ^ ||w*|| — 2 if X = 0. 

The assumption of m being even is purely for technical convenience, and can be discarded at the cost of 
making the proof slightly more complex. Also, note that m does not appear explicitly in the bound, but may 
appear implicitly, via 6 (for example, in a statistical setting <5 may depend on the number of data points per 
machine, and may be larger if the same dataset is divided to more machines). 

Let us contrast our lower bound with some existing algorithms and guarantees in the literature. First, 
regardless of whether the local functions are similar or not, we can always simulate any gradient-based 
method designed for a single machine, by iteratively computing gradients of the local functions, and per¬ 
forming a communication round to compute their average. Clearly, this will be a gradient of the objective 
function F{-) = — Yl'ILi -^*(')> which can be fed into any gradient-based method such as gradient descent 
or accelerated gradient descent |[T4ll . The resulting number of required communication rounds is then equal 
to the number of iterations. In particular, using accelerated gradient descent for smooth and A-strongly 
convex functions yields a round complexity of 0(y^l/A log(||w* ||^ /e)), and C)(||w*|| y^l/e) for smooth 
convex functions. This matches our lower bound (up to constants and log factors) when the local functions 
are unrelated (<5 = fl(l)). 

When the functions are related, however, the upper bounds above are highly sub-optimal: Even if the 
local functions are completely identical, and <) = 0, the number of communication rounds will remain 
the same as when 5 = n(l). To utilize function similarity while guaranteeing arbitrary small e, the two 
most relevant algorithms are DANE |[20ll . and the more recent DISCO |[26l . Eor smooth and A-strongly 
convex functions, which are either quadratic or satisfy a certain self-concordance condition, DISCO achieves 
0(1 -|- y/5/X) round complexity (1261 Thm.2]), which matches our lower bound in terms of dependence on 
5, A. However, for non-quadratic losses, the round complexity bounds are somewhat worse, and there are no 
guarantees for strongly convex and smooth functions which are not self-concordant. Thus, the question of 
the optimal round complexity for such functions remains open. 
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The full proof of Thru. [T] appears in the supplementary material, and is based on the following idea: For 
simplicity, suppose we have two machines, with local functions Fi, -F 2 defined as follows. 


= 


1 

0 

0 

0 

0 


0 

1 

-1 

0 

0 


^ , 5(1-A) T , <5(1 - A) T A „ „2 

Fi(w) = ---^diW-W + — ||w| 


2 

^ , <5(1 - A) T , A 

-F2(w) = ---Iw A2'W -\ - 


W 


2 

where 


0 0 0 0 
-10 0 0 

10 0 0 

0 1-10 
0-110 




1-10 0 0 

-11000 
0 0 1-10 
0 0-11 0 
00001 
0000-1 


0 

0 

0 

0 

-1 

1 


(3) 


It is easy to verify that for <5, A < 1, both Fi(w) and F 2 {w) are 1-smooth and A-strongly convex, 
as well as (5-related. Moreover, the optimum of their average is a point w* with non-zero entries at all 
coordinates. However, since each local functions has a block-diagonal quadratic term, it can be shown that 
for any algorithm satisfying Assumption [T] after T communication rounds, the points computed by the two 
machines can only have the first T-|-1 coordinates non-zero. No machine will be able to further ‘progress’ on 
its own, and cause additional coordinates to become non-zero, without another communication round. This 
leads to a lower bound on the optimization error which depends on T, resulting in the theorem statement 
after a few computations. 


3.2 Non-smooth Local Functions 

Remaining in the framework of algorithms satisfying Assumption [T] we now turn to discuss the situation 
where the local functions are not necessarily smooth or differentiable. For simplicity, our formal results here 
will be in the unrelated setting, and we only informally discuss their extension to a (5-related setting (in a 
sense relevant to non-smooth functions). Formally defining 5-relafed non-smoofh funcfions is possible buf 
nol alfogefher frivial, and is Iherefore lefl fo fulure work. 

We adapf Assumption [T]lo fhe non-smoofh case, by allowing gradienfs fo be replaced by arbifrary sub- 
gradienfs af fhe same poinfs. Namely, we replace Eq. Q by fhe requiremenf fhaf for some g S dFj{w), and 
7 ) > 0 ,7 -I- z/ > 0, 

7 w -I- i^g e span |w' , g' , (V^Fj(-w') -|- D)W , (V^Fj(-w') -|- D)~^w" 

w',w" e Wj , g' G dFj{w') , D diagonal , V^Fj(w') exisfs , (V^Fj(-w') -|- D)~^ exisfsj. 

The lower bound for fhis selling is staled in fhe following Iheorem. 

Theorem 2. For any even number m of machines, any distributed optimization algorithm which satisfies 
Assumption\I\ and for any A > 0, there exist X-strongly convex (1 -|- X)-Lipschitz continuous convex local 
functions F’i(w) and F 2 (w) over the unit Euclidean ball in (where d is sufficiently large), such that 
if w* = argmin.^^.||^||<l F’(w), the number of communication rounds required to obtain w satisfying 

F{'w) — F(w*) < e (for any sufficiently small e > 0) is — 2 for A = 0, and — 2 for A > 0. 
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As in Thm. [T] we note that the assumption of even m is for technical convenience. 

This theorem, together with Thm.[Tl implies that both strong convexity and smoothness are necessary for 
the number of communication rounds to scale logarithmically with the required accuracy e. We emphasize 
that this is true even if we allow the machines unbounded computational power, to perform arbitrarily many 
operations satisfying Assumption [T] Moreover, a preliminary analysis indicates that performing accelerated 
gradient descent on smoothed versions of the local functions (using Moreau proximal smoothing, e.g. ifTSl 
|24l), can match these lower bounds up to log factor^ We leave a full formal derivation (which has some 
subtleties) to future work. 

The full proof of Thm. |2] appears in the supplementary material. The proof idea relies on the following 
construction: Assume that we fix the number of communication rounds to be T, and (for simplicity) that T 
is even and the number of machines is 2. Then we use local functions of the form 


1 




{\W2 - Wsl + |u;4 - Wsl H-h \wt - wt+i\) + ^ ||wf 


1 A 2 

F2(w) = {\wi - W2\ + 1^3 - r(; 4 | H-h \wt+1 - Wt+2\) + I|w|| , 

y'2[T + 2) ^ 


where 6 is a suitably chosen parameter. It is easy to verify that both local functions are A-strongly convex 
and (1 + A)-Lipschitz continuous over the unit Euclidean ball. Similar to the smooth case, we argue that 
after T communication rounds, the resulting points w computed by machine 1 will be non-zero only on 
the first T + 1 coordinates, and the points w computed by machine 2 will be non-zero only on the first 
T coordinates. As in the smooth case, these functions allow us to ’control’ the progress of any algorithm 
which satisfies Assumption [T] 

Finally, although the result is in the unrelated setting, it is straightforward to have a similar construction 
in a ‘(5-related’ setting, by multiplying Fi and F 2 by S. The resulting two functions have their gradients 
and subgradients at most 5-different from each other, and the construction above leads to a lower bound 
of n((5/e) for convex Lipschitz functions, and for A-strongly convex Lipschitz functions. In 

terms of upper bounds, we are actually unaware of any relevant algorithm in the literature adapted to such a 
setting, and the question of attainable performance here remains wide open. 


4 One Round of Communication 

In this section, we study what lower bounds are attainable without any kind of structural assumption (such 
as Assumption [B. This is a more challenging setting, and the result we present will be limited to algorithms 
using a single round of communication round. We note that this still captures a realistic non-interactive dis¬ 
tributed computing scenario, where we want each machine to broadcast a single message, and a designated 
machine is then required to produce an output. In the context of distributed optimization, a natural example 
is a one-shot averaging algorithm, where each machine optimizes its own local data, and the resulting points 
are averaged (e.g. Il27ll25]l ). 

Intuitively, with only a single round of communication, getting an arbitrarily small error e may be in¬ 
feasible. The following theorem establishes a lower bound on the attainable error, depending on the strong 

^Roughly speaking, for any 7 > 0 , this smoothing creates a i-smooth function which is 7-close to the original function. 
Plugging these into the guarantees of accelerated gradient descent and tuning 7 yields our lower bounds. Note that, in order 
to execute this algorithm each machine must be sufficiently powerful to obtain the gradient of the Moreau envelope of its local 
function, which is indeed the case in our framework. 








convexity parameter A and the similarity measure 6 between the local functions, and compares this with a 
‘trivial’ zero-communication algorithm, which just returns the optimum of a single local function: 

Theorem 3. For any even number m of machines, any dimension d larger than some numerical constant, 
any <5 > 3A > 0, and any (possibly randomized) algorithm which communicates at most d^/128 bits in a 
single round of communication, there exist m quadratic functions overW^, which are 5-related, X-strongly 
convex and 9X-smooth, for which the following hold for some positive numerical constants c, d: 


• The point w returned by the algorithm satisfies 


E 


F[w) 


min F(w) 



in expectation over the algorithm’s randomness. 

• For any machine j, if'Wj = arg min^g^d Fj(w), then F{wj) — min^ gRd F(w) < d6'^/X. 


The theorem shows that unless the communication budget is extremely large (quadratic in the dimen¬ 
sion), there are functions which cannot be optimized to non-trivial accuracy in one round of communication, 
in the sense that the same accuracy (up to a universal constant) can be obtained with a ‘trivial’ solution where 
we just return the optimum of a single local function. This complements an earlier result in |[20l . which 
showed that a particular one-round algorithm is no better than returning the optimum of a local function, 
under the stronger assumption that the local functions are not merely J-related, but are actually the average 
loss over some randomly partitioned data. 

The full proof appears in the supplementary material, but we sketch the main ideas below. As before, 
focusing on the case of two machines, and assuming machine 2 is responsible for providing the output, we 
use 


Fi(w) = SAw"'' 




-1 



w 


^ 2 (w) = Y ||wf - dej, 

where M is essentially a randomly chosen {—1, -|-l}-valued d x d symmetric matrix with spectral norm at 
most cy/d, and c is a suitable constant. These functions can be shown to be (5-related as well as A-strongly 
convex. Moreover, the optimum of F(-w) = |(Fi(w) -|- F 2 (w)) equals 


w = 


6 A 


/ + 


2cy/d 


M 


Thus, we see that the optimal point w* depends on the j-th column of M. Intuitively, the machines need to 
approximate this column, and this is the source of hardness in this setting: Machine 1 knows M but not j, 
yet needs to communicate to machine 2 enough information to construct its j-th column. However, given 
a communication budget much smaller than the size of M (which is it is difficult to convey enough 
information on the j-th column without knowing what j is. Carefully formalizing this intuition, and using 
some information-theoretic tools, allows us to prove the first part of Thm. [3] Proving the second part of 
Thm.[3]is straightforward, using a few computations. 
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5 Summary and Open Questions 


In this paper, we studied lower bounds on the number of communication rounds needed to solve distributed 
convex learning and optimization problems, under several different settings. Our results indicate that when 
the local functions are unrelated, then regardless of the local machines’ computational power, many commu¬ 
nication rounds may be necessary (scaling polynomially with 1/e or 1/A), and that the worst-case optimal 
algorithm (at least for smooth functions) is just a straightforward distributed implementation of accelerated 
gradient descent. When the functions are related, we show that the optimal performance is achieved by the 
algorithm of |[26l for quadratic and strongly convex functions, but designing optimal algorithms for more 
general functions remains open. Beside these results, which required a certain mild structural assumption on 
the algorithm employed, we also provided an assumption-free lower bound for one-round algorithms, which 
implies that even for strongly convex quadratic functions, such algorithms can sometimes only provide 
trivial performance. 

Besides the question of designing optimal algorithms for the remaining settings, several additional ques¬ 
tions remain open. First, it would be interesting to get assumption-free lower bounds for algorithms with 
multiple rounds of communication. Second, our work focused on communication complexity, but in practice 
the computational complexity of the local computations is no less important. Thus, it would be interesting 
to understand what is the attainable performance with simple, runtime-efficient algorithms. Finally, it would 
be interesting to study lower bounds for other distributed learning and optimization scenarios. 
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A Proofs 


A.l Proof of Thm.[I] 

The proof of the theorem is based on splitting the machines into two sub-groups of the same size, each of 
which is assigned with a finite dimensional restriction of Fi and F 2 (see Eq. dSll), and tracing the maximal 
number of non-zero coordinates for vectors in Wj, the set of feasible points. 


Recall that Fi are defined as follows: 


= 


^ , (i(l-A) T , (5(1 - A) T A „ „2 

Fi(w) = - - -AlW- - -w — ||w|| 


T2(w) = 


(5(1 - A) T 


A 


0 -1 

0 0 

0 0 


0 

-1 

1 

0 

0 


0 

0 

0 

1 

-1 


0 0 

0 0 

0 0 

-1 0 

1 0 


W A2W + — ||w| 


Ao = 


1 

-1 

0 

0 

0 

0 


-1 

1 

0 

0 

0 

0 


where 

0 
0 
1 

-1 
0 
0 


0 

0 

-1 

1 

0 

0 


0 

0 

0 

0 

1 

-1 


0 

0 

0 

0 

-1 

1 


Formally speaking, we consider fhe mafrices Ai, A 2 as infinite in size, so fhaf each Fi is defined over 
.£^(M), fhe space of square-summable sequences. To derive lower bounds in M'^, we consider fhe following 
resfricfions of Fi and F: 

[-^i]d(w) := Fi{wi,W 2 , w € 

[J’ll.(w) + [J’2|.(w) 

Note fhaf [Fj]rf(-w) and [F](i(w) produce fhe same values as Fi{w) and F(w) do for vectors such fhaf 
Wj = 0 for all i > d. Similarly, we define fhe d x d leading principal submafrix of Ai by [Ai]^. 

We assign half of fhe machines wifh [Fi]d, and fhe ofher half wifh [^ 2 ]^- To prove fhe fheorem, we need 
fhe following lemma, which formalizes fhe infuifion described in fhe main paper. Lef 

EQ,d = {0} , Et4 = span{ei_rf,..., 6^,^}, 

where Oj ^ G denofe fhe sfandard unif vectors. Then, fhe following holds: 

Lemma 1. Suppose all the sets of feasible points satisfy Wj Q Et 4 for some T < d — 1, then under 
assumption\I\ right after the next communication round we have Wj C 

Proof Recall fhaf by AssumpfionlH each machine can compute new poinfs w fhaf salisfy fhe following for 
some 7 , > 0 such fhaf 7 -|- > 0 : 

7 w -h i/V[Ti]rf(w) E span |w' , V[Ti]rf(w') , (V^[Fi]rf(w') -f L>)w" , (V^[Fi]rf(w') -h 
w',w"GVEj , iA diagonal , V^[Fj]rf(w') exisfs , (V^[Fj]rf(w')-|-exisfsj. 
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We now analyze the state of the sets of feasible points prior to the next communication round. Assume that 
T is an odd number, i.e., assume T = 2A: + 1 for some /c = 0,1,.... The proof for the case where T is even 
follows similar lines. Note that for any w', w" £ Wj, we have 


V[Fi]rf(w') = + ^w' C E2k+i,d 

(v2[Fi]d(w') + D)W' = + D + X?j w" C E 2 k+i,d 

(V2[Fi]d(w') + D)-^W' = + I) + A/) w" C E 2 k+i,d 

For any viable diagonal matrix D. Therefore, since Wj C E 2 k+i,d, we have that the first point generated by 
machines which hold [Fi]rf(w) must satisfy 

7W + i/V[Fi]rf(w) G E2k+i,d 

for 7 , v as stated in the assumption. That is, 

^(1-^) r. 1 / <5(1-A) 

- 2 - ^[^l]d + (7 + 'Y J ^ -2-^ ^2k+l,d 


Which implies, 

(^ <5(1^ A) ^[^i]^ + (7 + ^ ^ 2 fc+i,d 

'-V-' 

H 

Since [A]\d is positive semidefinite, it holds that H is invertible. Also, [Ai]^, H and admit the same 
partitions into 1x1 and 2x2 blocks on the diagonal, thus E 2 k+i 4 ^ £^ 2 fc+i,d> yielding w G £^ 2 fc+i,d- 
Inductively extending the latter argument shows that, in the absence of any communication rounds, all the 
machines whose local function is [Fi]rf(w) are ‘stuck’ in £' 2 ^+ 1 ,d- 

As for machines which contain [£ 2 ]d(■w)^ we have that for all w', w" G Wj 

V[£ 2 ]d(w') = + ^w' C E 2 k+ 2 ,d 

(V2[£2]d(w') + £)w" = + D + X?j w" C E 2 k+ 2 ,d 

(V2[£2]d(w') + D)-^W' = + D + XI^ w" C £ 2 fc+ 2 ,d 

For any viable diagonal matrix D. Therefore, the first generated point by these machines must satisfy, 

7 w + i/V[£i]d(w) G E2k+2,d 


for appropriate 7, u. Hence, 

/<5(1 - A) , , , , uX,\ 

( - 2 -^[^2]d + (7 + ~2^J ^ ^2k+2,d 
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Similarly to the previous case this implies that w G E 2 k+ 2 ,d- It is now left to show that these machines 
cannot make further progress beyond E 2 k+ 2 ,d without communicating. To see this, note that for all w', w" G 
E 2 k+ 2 ,d we have, 


V[F 2 ]d(w') = ^ + ^w' C E2k+2,d 

(V2[F2]d(w') + D)W' = (^fcl^[yl2]d + D + X?j W" C E2k+2,d 

(V 2 [F 2 ]rf(w') + D)-W = + DI + A/) w" C E2k+24 

This means that all the points which are generated subsequently also lie in £^ 2 fe+ 2 ,d> i-e-> without communi¬ 
cating , machines whose local function is [T 2 ]rf(w) are stuck in E 2 k+ 2 ,d- Finally, executing a communication 
round updates all the sets of feasible points to be Wj := E2k+2,d- D 

The following is a direct consequence of a recursive application of LemmaH] 

Corollary 1. Under assumption\I\ after T < d — 1 communication rounds we have 


WjUEr+i, jG m} 


With this corollary in hand, we now turn to prove the main result. First, we compute the minimizer of 
the average function F(w) = 2 ■^i(w)+ 2 -^2(w) denoted by w*, whose form for even number of 

machines is simply: 


, (i(l - A) T ,, , , <5(1 - A) T A „ „2 

F(w) =- -— (^1 + A2) W--—W + - ||w|| 

o 4 z 

By first-order optimality condition for smooth convex functions, we have 


(^1 + AI2) + aA w* - = 0 , 


or equivalently, 

+ Ji^f) 

whose coordinate form is as follows 

5(r^) +W*[A:- 1] = 0. 

as a geometric sequence for some C, as follows: 


M k , w*[/c -|- 1] — 

The optimal solution can be now realized 
By Eq. (I5]l, we must have 


c' 



4A A 


C + 1 = 0, 


(4) 

(5) 
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with the smallest root being 


c 


2 + 


4A 

5(1-A) 



1 + 


2A 

6 ( 1 ^ 



2A y 


- 1 


( 6 ) 


Therefore, this choice of C satisfies Eq. (I5]l, and it is straightforward to verify that it also satisfies Eq. (ID), 
hence w* indeed equals It will be convenient to denote a continuous range of coordinates of 

(C^)^i by Ca-.b where a e N and 6 G N U oo. Also, using the following inequality which holds for x > 1 


X — a/x^ — 1 > exp ( — ; - - I , X > 1 

together with Eq. ([ 6 ll yields 

We now use this computation (with respect to Fi,F 2 ) to find the minimizer of [F]d, defined as the 
average function of the finite-dimensional restrictions [Fi]d, [^ 2 ]^ actually handed to the machines. Eix 
d G N and denote the corresponding minimizer by 

Wd = arg min [F]rf(w) 

Eet -wt be some point which was obtained after T < d — 2 communication rounds. To bound the sub¬ 
optimality of wx from below, observe that 


[F]rf(wT) - [F]d(w^) > [F]rf(wr) - [FUCi-.d-i) 

= [FUM - i"(w*) + F(w*) - 
= F(wt) - F{w*) + F(w*) - F{Cx.d-i) 

'' ---^ '-V-^ 

A B 

where the last equality follows from Corollary ([I]), according to which all the coordinates of wr, except for 
the first T -|- 1 < d — 1, must vanish. To bound the A term, note that 

OD 00 

||wr-W*f > ^ C"* = C2(r+l)^^2t^^2(T+l)||^*||2 

t=T +2 t=l 

The fact that F(w) is A-strongly convex implies 

A 2 2 

F{wt) — F{w*) > — llwT — W* If > - - - ||w* II . 

Inequality O yields 

F(wt) — F(w*) > — exp f —^- I IIw*Ip 

^“2 l^v/d(l/A-l) + l-y 
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To bound the B term from below, note that since F is 1-smooth we have 


1 r 2 {d-i) ' 

F(w*) - F(Ci:d-l) > - 2 IK* - Cl:d -1 II = E ^ 

t=l 


2 t 


^2(d-l) 


W 


Combining both lower bounds for the terms A and B, we get for any T < d — 2 


[F]rf(wr) - [F]rf(w;j) > I 2 exp 


-4r-4 


^2(d-l)' 


\ (5(1/A — 1) + 1 — 1 


w 


.*l|2 


Picking d sufficiently large, and considering how large the number of communication rounds T must be to 
make this lower bound less than e, we get 


T > 


\/ <^(1/A — 1) + 1 — 1 
4 



- 1 . 


It is worth mentioning that by computing the exact minimizers of [Fj]^ one may derive a lower bound 
such that the choice of d does not depend on the parameters of the problem, except for the number of 
communication rounds. Nevertheless, such analysis requires a more involved reasoning which we find 
unnecessary for sfating our results. 

For the non-strongly convex case, where A = 0, using Corollary [T] and a similar analysis (virtually 
identical to the proof of Theorem 2.1.7 in ifT^ l. we have that if T < — 1), then 


F(w'r) — F(w*) > 


32(r + 2)2 


Therefore, to obtain an e-suboptimal solution for this case, we must have at least 



-2 


communication rounds, for sufficiently small e. 

A.2 Proof of Thm.|2] 

We construct two types of local functions, and provide one of them to m/2 of the machines, and the other 
function to the other m/2 machines, in some arbitrary order. In this case, the average function is simply the 
average of the two types of local functions. 

We will first prove the theorem statement in the strongly convex case, where A > 0 is given, and then 
explain how to extract from it the result in the non strongly convex case. 

Fix a natural number k and some b e [0, l/y/k], to be specified later. We define the following local 
function over the unit ball: 

= ^\b- u;[l]| + ^ (|n;[2] - w[3]\ + |u;[4] - w[5]\ + • • • + \w[k - 2] - w[k - 1]|) + ^ ||wf 
^ 2 ,fc(w) = dtufl] - u;[2]| + Imfd] - t(;[4]| H-h \w[k - 1] - ru[A:]|) + ^ ||wf (8) 
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For even k < d, and 


^i,fc(w) = ^\b- u;[l]| + ^ (|n;[2] - n;[3]| + |n;[4] - n;[5]| + • • • + \w[k - 1] - w[k]\) + ^ ||wf 

^ 2 ,fc(w) = (|n;[l] - w;[2]| + |n;[3] - w;[4]| + • • • + \w[k - 2] - w[k - 1]|) + -^ ||wf 

otherwise. Being a sum of convex functions, both local functions are convex, and in fact A-strongly convex 
due to the ^ ||w||^ term. Furthermore, both function are (1+A)-Lipschitz continuous over the unit Euclidean 
ball. To see this, let d{-) denote the subdifferential operator and note that 

g e Slft-mfl]! g € conv{<To,-cro} 
ged\w[l]-w[l + l]\ g € conv{cr;,-cr;} 

where 

cTo = (1,0,... ,0) 



I «+i 


Assume for a moment that A = 0, then by the linearity of the sub-differential operator that 

Vg G 5Fi,fc(w), ||g|| < < 1 

Vg G <9F2,fc(w), ||g|| < 

which shows that, for A = 0, both functions are 1-Lipschitz. For A > 0, note that ^ ||w||^ is A-Lipschitz 
over the unit ball and A-strongly convex. Therefore, using the linearity of the sub-differential operator again, 
we see that both Fj are (1 -|- A)-Lipschitz and A-strongly convex functions over the unit ball. 

Similar to the smooth case, the following lemma shows that, no matter how the subgradients are chosen, 
at each iteration at most one non-zero coordinate may be gained. 

Lemma 2. Suppose all the sets of feasible points satisfy Wj C ET^dfor some T < d — 1. Then under 
assumption\I\ right after the next communication round we have Wj C Ex+i^d- 

Proof Recall that by Assumption [U (modified for the non-differentiable case), each machine can compute 
new points w that satisfy the following for some 7 , > 0 such that 7 -|- 1 / > 0: 

7 w -h z^gj,fc(w) G span j’w' , gi,fc(’w') , -h D)w" , -h D)~^w" 

w', w" G Wj , gi,fc(’w') G dEi^ki^') , D diagonal , V^Fi,fc(’w') exists , (V^Fi,fc(w') -f D)~^ existsj. 

We now analyze the state of the sets of feasible points prior to the next communication round. Assume 
that T is an odd number, i.e., assume T = 2p -|- 1 for some p G N U {0}. We show that as long as no 
communication round has been executed, it must hold that Wj C Ej^^^ for machines whose local function 
is El, and that Wj C Ex+i^d for machines whose local function is F2. The case where T is even follows a 
similar line. 
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Let 


Eo,d = {0} , ET,d = span{ei,rf,..., eT,d} 

where e* G denote the standard unit vectors. First, we prove the claim for machines whose local 
function is Fi^k- In which case, for any w', w" G £' 2 ^+ 1 ,rf, it holds that 

gi,fc(w') C comr{±a 2 i \l = 0,...,p} ^ £2^+1,d 
(V 2 £i,fe(w') + £)w" = {XI + D) w" C E2p+i,d 
(V 2 £i,fe(w') + £)-iw" = {XI + D)-^ w" C E2p+i,d 

For any viable diagonal matrix D. Therefore, we have that the first point generated by machines which hold 
£1 fc must satisfy 

7 W + z^gi,fc(w) G E 2 P +14 (9) 

for 7, u as stated in Assumption ([T])- Note that, if = 0 (which by assumption means that 7 > 0) then 
clearly w G £ 2 ^+ 1 ,d- As for 7 ^ 0, suppose by contradiction that w ^ £ 2 ^+ 1 ,d- That is, assume that there 
exists some j > 2p + 1 such that w[j] 7 ^ 0. First, if the absolute value terms in Ei^k do not involve w[j], 
e.g., when j = d and d is even, we have gi,fe('w)[j] = Aru[y]. In this case, by Eq. dUl we have 

-fw[j] + uXw[j] = (7 + i^X)w[j] = 0, 


and since vX > 0, this implies that w[j] = 0 - a contradiction! Thus, it remains to consider the cases of 
either odd dor j 7 ^ d. In both of these cases, w[j] appears in one of the absolute value terms in £i,fe, either 
as Iwlj — 1 ] — rr>[j]| or \w[j] — w[j + 1 ]| (depending on whether j is odd or even). 

Let ( > p be such that either 2( = j or 2( + 1 = j, depending on the parity of j. We note that any valid 
subgradient must satisfy 

gl,»(w)[2!| = + A«,|2(] 

glA(w)[2/ + 1] — —-j= —I- \w]2L + 1] 

for some a G [0,1], such that if w[2l] — tz;[2( + 1] 7 ^ 0 then 


sgn {w[2l] — u;[ 2 / + 1 ]) = sgn 


2 a - 1 


( 10 ) 


where sgn() is the sign function. Rearranging terms in Eq. (|9ll and using the facts that coordinates 21, 2/ + 1 
are always zero in £ 2 ^+ 1 ,d> as well as 7 + z^A > z/A > 0, we get 


w[2l] = 
w[2l + 1 ] = 


Therefore, 


w[2l] — w[2l + 1 ] = 


-z^( 2 a- 1 ) 
\/^(7 + uX ) 
u{2a - 1 ) 
\/^(7 + uX ) 

-2u{2a - 1 ) 


( 11 ) 


\/^(7 + vX ) 


( 12 ) 


18 








which implies 


sgn {w[2l] — w[2l + 1]) = sgn 


( 2 a - 1 ) 


contradicting Eq. (flOl) . Hence, we must have w[2l] — w[2l + l] = 0, in which case Eq. (fT^ implies a = 1/2. 
Thus, by Eq. (fTTT) 


w[2r\ = w[2l + 1] = 0, 


which contradicts the assumption that either w[j] (and hence w[2l] or w[2l + 1]) is not zero. Thus, we have 
shown that rv G E' 2 p+i,d! for the first point generated by machines holding fc. Repeating the argument, 
we get that any point generated by those machines, in the absence of any communication rounds, is ‘stuck’ 

in £' 2 p+i,d- 

We now turn to prove the claim for machines whose local function is using an almost identical 
argument, which we provide below for completeness. Eor these functions, we assume that initially Wj C 
-® 2 p+l,d^ and will show any additional points computed locally by the machines must be in £' 2 p+ 2 ,d- We 
begin by noting that for any -w', W in E 2 p+ 2 ,d (and in particular E^ 2 p+i,d)> it holds that 


g 2 ,fc(w') C conv{±cr 2 i+i \ l = 0,... ,p} C E2p+2,d 
(V2F2,fe(w') + D)W = (A/ + D) w" C E 2 p+ 2 ,d 
(V2F2,fe(w') + H)-V" = (A/ + D)-^ w" C E 2 P +24 


Eor any viable diagonal matrix D. Therefore, we have that the first point generated by machines which hold 
E 24 must satisfy 


7 ’W + Z^g 2 ,fc(w) G E 2 p+ 2 ,d 


(13) 


for 7 , as stated in the assumption. Note that, if = 0 then clearly w G E^ 2 p+ 2 ,d- As for z/ / 0, suppose 
by contradiction that w ^ E^ 2 p+ 2 ,d- That is, assume that there exists some j > 2p + 2 such that w[j] ^ 0 . 
Eirst, if the absolute value terms in E24 do not involve w[j], e.g., when j = d and d is odd, we have 
g2,A;(w)[j] = At(;[j]. In this case, by Eq. ([T3l) we have 


iw\j] + v\w\j] = (7 + vX)w[j\ = 0 , 


and since v\ > 0, this implies that w\j] = 0 - a contradiction! Thus, it remains to consider the cases of 
either even dor j / d. In both of these cases, w[j] appears in one of the absolute value terms in F24, either 
as \w[j — 1 ] — w[j] \ or \w[j] — w[j + 1 ]| (depending on whether j is odd or even). 

Eet I > pbe such that either 2Z + 1 = j or 2f + 2 = j, depending on the parity of j. We note that any 
valid subgradient must satisfy Any valid subgradient must satisfy 



1 — 2 a 

g 2 ,fc(w)[ 2 Z + 2 ] = — -j= —h \w\2l + 2 ] 


for some a G [0,1], such that if w[2l + 1] — m[2/ + 2] / 0 then 



( 14 ) 
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( 15 ) 


Rearranging terms in Eq. (fT^ and using the fact that 7 + z^A > z/A > 0, we get 



Therefore, 



(16) 


which implies 



a contradiction to Eq. (fTdl) . Hence, we must have w[2l + 1] — re[2/ + 2] = 0, in which case Eq. (fT^ implies 
a = 1 / 2 . Thus, by Eq. dTSl) 


w[2l + 1] = w[2l + 2] = 0 


which contradicts our assumption that w[j] (and hence either w[2l + 1] or w[2l + 2] is not zero). Thus, 
we have shown that w S f? 2 p+ 2 ,d- As before, repeating the argument together with the assumption that 
Wj C E2p+2,d shows that, in the absence of any communication rounds, all the machines whose local 
function is F2^k are ’stuck’ in E2p+2,d- Therefore, before the next communication round, Wj C E2p+2,d 
for all machines j holding F 2 J. Moreover, as shown earlier, Wj C E^2p+i,d for all machines holding Fi 7 ,. 
Therefore, after the next communication round, Wj C E 2 p+ 2 ,d for any machine j. □ 

Repeatedly applying Eemma|2l we get the following corollary: 

Corollary 2. Under assumption\I\ after T < d — 1 communication rounds we have 


WjCEr+i, 


With this corollary in hand, we now turn to establish the main result, namely, bounding from below the 
optimality of points in Wj after T communication rounds. Choosing the dimension d such that T < d — 2, 
we employ the local functions defined in Eq. ([H) with fc = T + 2. In which case, the average function is 


A 

2 



2 


w 


2 ’ ^ ^ ^ 2 ^ 2V2' ' 2y^2{T + 2) ^ 


The key ingredient in deriving the lower bound is Corollary (|2ll, according to which after T communi¬ 
cation rounds, all but the first T + 1 coordinates must be zero, in particular w[T + 2] =0. Using this and 
the triangle inequality, we have 
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for all w in Wj. Therefore, we can lower bound the objective value of the algorithm’s output by 


min 

uiSK 


2^2 


\b — w\ + 


2yj2{T + 2) 


m + 


Xur 


On the flip side, the minimal value of F(’w) over the unit Euclidean ball can be upper bounded by F{wb) 


for some b < 


where 


Wfe = (V^^,0,...,0) 
T+2 times 


Putting both bounds together yields. 


min F{'w) — min F{'w) > min F{w) — F{wb) 


wSW. 


w <1 


wGWj 


> min 


1 


\b — w\ + 


1 


«'6Ky2V2''' ' ' 2y^2(r + 2) 

Assuming T > ^ — 2 (so that A > 2 {t+ 2 ) 

i,= 


ku + 


\w‘^\ A(r + 2)62 


(17) 


2\{T + 2)^2{T + 2) 

(note again that is indeed in the unit ball for this regime of A and T). In this case, the minimal w in 
Eq. ([TT]) is 


2\(T+2)y/2{T+2) 


, SO we get a suboptimality lower bound of 



1 

' \ 

\ 2^2{T + 2) 

2A(r + 2)v^2(r + 2) 

8A((r + 2)v^2(r + 2)V 


> 


1 


8A(r + 2)2 

1 

16A(r + 2)2 


+ 0 - 


16A(r + 2)2 


(18) 


This bound holds in particular for any T > — 2] . If the number of communication rounds T is less than 

— 2] , then clearly we cannot do better than with — 2] communication rounds. Therefore, for any 
number of communication rounds T, the suboptimality is at least 


mm • 


Therefore, for any e G I 0, 


Mr^-2i+2). 


16A([^-2] +2)' ’ 16A(r + 2) 


, we would need at least T > 


— 2 communication rounds 


to get an e-suboptimal solution. This implies the theorem statement for A-strongly convex functions. 

Einally, we treat the case where the local functions are not required to be strongly convex. In this setting, 
for proving a lower bound, we can use the same construction as in Eq. ([8]), where we are free to choose any 
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A. In paiticular, let us choose A = 2 {t+ 2 ) ’ lower bound derived above (note that in this case 

the condition T 2 trivially holds). Plugging in it into dTSl) . we establish that for any number of 

communication rounds T, the suboptimality is at least 

1 

8(r + 2 )' 

Considering how large T must be to make this smaller than some e, we get that T must be at least i — 2. 


A.3 Proof of Thm. |3] 

As usual, we construct two functions Fi,F 2 , and provide Fi to m/2 of the machines, and F 2 to the other 
m/2 machines, in some arbitrary order, such that the machine designated to provide the output receives F 2 . 
Note that the average function F is simply i(Fi(w) + F 2 (w)). 

Let c be a certain positive numerical constant (whose value corresponds to c in Lemma^below). Given 
some symmetric M £ { — 1, +1}'^^'^, where ||M|| < cyfd, and j £ {[d/2],... , d}, define 

i^ 2 (w) = Y ||wf - dOj, 

The average F of Fi, F 2 equals 

F(w) = ^(Fi(w)+F 2 (w)) = 
with an optimum at 

w* = ^ e,-. 

6A V 2 cVd ) ' 

The following lemma establishes that the functions satisfy the strong convexity, smoothness and related¬ 
ness requirements of the theorem. The proof also establishes that the inverse in the definition of Fi indeed 
exists. 


Lemma 3. Fi and F 2 are A strongly-convex, 9A smooth, and 6-related. 


Proof. The Hessian of F 2 is 3A/, which implies that F 2 is 3A smooth and strongly convex (and in particular, 
A-strongly convex). As to F\, note that since ||M|| < csfd, then 


2cy/d 


M 


~ 2 ’ 


The fact that the spectral radius and spectral norm of symmetric matrices coincide implies that the eigen¬ 
values of the matrix I lie between 1 — 5 = | and 1 + ^ Thus, all the eigenvalues are 


strictly positive, hence the matrix is indeed invertible as in the definition of Fi. Moreover, the eigenval- 

-1 


ues of the inverse lie in 


1 1 

3/2’ 1/2 


= [|, 2 ], and therefore those of 3A ( [1 + 


2cy/d 


M 


-hi 


lie in 
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[3A (I — 5) ’ “ ^)] ~ [i’ • Thus, the spectrum of the Hessian of Fi lie in [A, 9A], which implies 

that -Fi is A-strongly convex and 9A smooth. 

To show (5-relatedness, the only non-trivial part is upper-bounding the norm of the difference of the 
quadratic terms, which equals the following: 


3A 




-1 



= 3A 




-1 




(19) 


Since ||M|| < c\/d, the eigenvalues of —/lie between — l = — g and 

which implies that Eq. (fT^ can be upper bounded by 3A < 5. □ 


The next lemma proves the second part of the theorem, namely an upper bound on the suboptimality of 
any local function optimizer. 

Lemma 4. For any Wj = argmin^gj^d T"j(w), it holds that F{wj) — min^ggd F(w) < fXfor some 
numerical positive constant c. 

Proof. The optimum of any quadratic and strongly-convex function + b^w + c equals ^A~^h. 

Therefore, if w* is the optimizer of F, and we denote the parameters of F and Fj by A, b, c and Aj,hj, Cj 
respectively, then 


Wi — w = — 
I j II 2 


AJ% - A-^h 

3 


< 


A-^hj - A-^hj + A-^hj - A-^h 


1 

< - 
“ 2 


Al-^ - Al-^ 


i-H 


mil+ 11^ iiii^j 

By definition of Fi, F 2 and the average function F, this is at most 


bo — b 


1 

2 



Al-i 


5+ ||A-i 



( 20 ) 


In Lemma [3l we showed that Fi,F 2 are A-strongly convex and 9A smooth, which implies that the eigen¬ 
values of Aj as well as A lie in [|, ^]. Therefore, the eigenvalues of AJ^ and A~^ lie in [^, ^], so 


u 


-ll 


< j and 


A 


-1 



<!■ Substituting this back into Eq. (l20l) . we get 


A 

Einally, since F is 9A-smooth, and its minimizer is w*. 




t rrt *i|2^9A/35y 

F(wj)-F(w ) < y ||wj - w II > 


□ 


which equals 81(5^/8A as required. 
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We now turn to derive the lower bound in the theorem statement. As discussed earlier, the intuition is 
that the optimal point w* is a function of the j-th column of M, so the machines holding Fi must broadcast 
enough information on M to the designated machine producing the algorithm’s output (the machine, by 
construction, holds F 2 , and hence knows j but not M). As long as the communication budget is smaller 
than the size of M, this will be difficult to achieve. This intuition is formalized in the following lemma, 
which is based on information-theoretic tools: 


Lemma 5. For any dimension d > c (where c is the same constant as in Lemma^and the definition of Fi), 
and for any (possibly randomized) 1-round algorithm using at most d‘^/128 bits of communication, there 
exists a valid choice of M, j for the functions Fi,F 2 defined above, such that the vector w returned by the 
algorithm satisfies 


E 


w — w 




where the expectation is over the algorithm’s randomness, and d is a positive numerical constant. 


Using the lemma and the A-strong convexity of Fi, F 2 (and hence their average F), 

E[F(w) - F(w*)] > ■^E[||w-w*f] > 

hence proving the theorem. 

It now remains to prove Lemma [S] 


Proof of Lenima]^ By definition of w*, we have that the j-th column of M, designated as Mj, satisfies 



Given the predictor w returned by the algorithm, define 



This can be thought of as the algorithm’s ‘estimate’ of the j-th column of M, based on the returned predictor. 

Define [rc] = min{l, max{ —1, m}} as the clipping operation of a scalar w to [—1, -1-1], and for a vector 
w = (mi,..., Wd), define [w] = ([mi], [m 2 ],..., [rcd]). By the expressions for Mj, Mj above, we have 


[Mj-Mfi 


2cV~d— (w — w*) 
d 


E 

i=l _ 


l2cXVd 


(Wi - w*) 


< 


(I2c\^/d\ 


\ ^ J u 


^(mi-m■)^ 


which implies that 


I * 11 Z -v.. 

w — w > 


12cXVd 


[Mj-Mfi 


To get the lemma statement, it is enough to show that for some M, j, one can lower bound E 
(where the expectation is over the algorithm’s randomness) by some constant multiple of d. 


[Mj - Mj_ 


( 21 ) 
2" 
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Below, we will prove that if M (in the definition of Fi) is chosen uniformly at random from all 
{—1, +l}-valued d x d symmetric matrices, and j (in the definition of F 2 ) is chosen uniformly at random 
from {\d/2\,... ,d}, then for any deterministic algorithm. 


Emj- 




( 22 ) 


Let us first show how this can be used to prove the lemma. To do so, we will need the following lemma on 
the concentration of the spectral norm of random symmetric matrices. 

Lemma 6 ( II21L Corollary 2.3.6). There exist positive numerical constants c, c', such that if M is a d x d 
symmetric matrix, where each entry > i is chosen independently and uniformly from { — 1, +1}, and 

d > c, then Pr(||M|| > c\/d) < cexp(—c'd). 

First, we note that the expectation in Eq. (l22l) is over all symmetric {—1, +l}-valued matrices, includ¬ 
ing those whose spectral norm may be larger than cs/d. However, by Lemma 0 Pr(||M|| > cVd) < 
cexp(—c'd) for some absolute constant c'. Letting E be the event that ||M|| > cs/d, and noting that 
II [w] 11^ < d for any vector w, we have 



2\ 

= E 


2 

E 

Pr(S) + E 


2 

^E 


Pv{^E) 


< dPT{E) + E 





< cdexp{—c'd) + E 


[Mj - Mj 


■^E 


Plugging back into Eq. (|22]) . we get that 


E 


[Mj - Mf 


nE 


> - — cdexp(—cid), 


which is at least d/1% for any d larger than some constant. Combining with Eq. (1211 . we get 


E 


w — w* 


2 


^E 




This inequality implies that for any deterministic algorithm, in expectation over the random draw of j and a 
{—1, +l}-valued matrix M with spectral norm at most c\/d, ||w — w*|p will be at least F for some 
suitable constant d. By Yao’s minimax principle, this implies that for any (possibly randomized) algorithm, 
there will be some deterministic choice of M, j such that ||M|| < c\fd, and for which 


E 


w — w* 


2 


> c' 



2 


(in expectation over the algorithm’s randomness), yielding the lemma’s statement. 

It now remains to prove Eq. (1221) . assuming j is chosen uniformly at random from {[d/2], ... ,d}, 
and M is chosen at random (i.e. each entry at or above the main diagonal is chosen independently and 
uniformly from {—1, +1}). Roughly speaking, the proof idea is to reduce this to an upper bound on how 
much information the machines holding M can send on M’s entries (and more particularly, on the entries 
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in the upper-right quadrant of M). Since this quadrant is composed of Q{(P) random variables, and the 
machines can send much less than d? bits, this information is necessarily restricted. 

Let Pr(-) denote probability with respect to the random choice of M, j, and let Prj (•) denote probability 
conditioned on the choice of j. Recalling that any entry Mj^i in the j-th column has values in {—1, +1}, it 
follows that either Mj^i has the same sign as Mj^i, or that ([Mj j — is at least 1. Therefore, we have 

the following: 


E 


rd/21 


rd/21 




i=l 


i=l 


> 


Y E < 0 Pr (M,,iM,, ^ < o) + 0 


2=1 


2=1 


< 0 


1 


M/21 W2] / d 

> Y: Pr < o) = y: Y P-r {m,. 

i=l \ L / J ^.^p^/21 

M/21 /I ^ 1 r. 

Y1 > 0^ + 2^^/ < 0 

i=i /=rd/21 ^ 
rd /21 d 

1 _l_ I (i/21 ^ ^ ~ > 0^ — Prj > 0\Mj^i < 0 

L / J i=i rd/21 
rd /21 d 


> 


1 + [d/2\ 
1/2 


> 


1 j_ |,i/2| ^ ^ < oj Pm > 0 

L / J i=i j=rd/21 


rd/21 1/2 


rd /21 d 


1 + [d/2\ J 


E E I Prj (^Mj^i > 0\Mj^i < 0^ — Prj > 0\Mj^i > 0^ 

/=M/2l 


(23) 


Let S be the vector of bits broadcasted by the machines holding Fi, and received by the machine designated 
with providing the output (recalling that it only holds F 2 ). Note that conditioned on S and j, the algorithm’s 
output (and hence Mj^i) is independent of M. Therefore, we have 


Prj > 0\Mj^i < 0^ — Pr^ > 0\Mj^i > 0^ | 

= Y^^3 > 0|5,Mj-i < 0 ) Pr(5|Mj-i < 0) - ^P^ > 0\S,Mj^i > o) Pv{S\Mj^i > 0) 

s s 

= ^Prj (Myi > 0|5) Pr(5|M,'i < 0) - ^P^ (Mj,* > 0|5) Pr(5|Mj-j > 0) 
s s 

< Y |pM {MjA > 0 I- 5 ) < 0) - Pr(5|M,-, > 0)) 

s 

< Y |PM(^l^id < 0) - PM('5|M,-i > 0)1 
s 

< Y IPmI^I^jM < 0) - Pm(5)| + \PTj{S\Mj,i > 0) - Pr,(S)| . 

s s 
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Since S is sent by the machines holding Fi (and not F 2 ), it is independent of j. Therefore, we can write the 
above as 

J] |Pr(5|M,-, < 0) - Fv{S)\ + |Pr(5|M,-, > 0) - Pr(5)| 

s s 

where j in the conditioning is a fixed index. Using Pinsker’s inequality, we can upper bound the above by 

^2Dki {p{S\M,^i < 0)||p(5)) + ,j2Dki {p{S\Mj,i > 0)||p(5)) 

where p is the probability distribution of S, and Dj^i is the Kullback-Leibler divergence. By the elementary 
inequality ^/a + Vh < y^2(a + h) for all non-negative a, b, we can upper bound the above by 


^A{Dki {p{S\M,,i < O)IIp(S)) + Dki {piS\Mp, > 0)||p(5))) 

= {Dki {p{S\Mp, < 0)||p(5)) + Dki {v{S\Mp, > 0)||p(5))). 

Using the fact that j (for some fixed j, i) is uniformly disfribufed in {—1, +1}, and that the mutual infor¬ 
mation I{X;Y) between random variables X, Y equals Ey [Df:i{p{X\Y = y)\\p{X))], the above equals 


Recalling that this is an upper bound on 

rrf/21 d 


Prj ( Mj^i > Q\Mj^i <01— Prj ( Mj^i > 0\Mj^i > 0 


, we have 


1/2 


1 -h I d/21 ^ ^ 

L / J i=l j=ld/2] 


E E I Prj > 0\Mj^i < 0^ — Prj > 0\Mj^i > 0^ 


V2 


M/21 d 


1 


M/21 d 


<TTW2iS E U(®^“«) = '^w^iT572T(rTMMS E 

L / J i=l j=ld/2] \ L / j; 


< V2\d/2'\ 


1 


M/21 d 

^W2l(i + <My 


(24) 


where the last step is by Jensen’s inequality (i.e. the average of square roots is upper bounded by the square 
root of the average). The expression in the square root equals the average mutual information between a 
random variable S (composed of at most d^/128 bits), and [d/2] (1 -|- [d/2j) binary random variables Mj^i, 
where i S {l,...,[d/2]},j e {[d/2],...,d}, which are all independent by construction. By Lemma 6 in 
ifTSll . it is at most (d^/128)/ ([d/2] (1 -|- [d/2j)) < 1/32, so we have 


V2[d/2] 


\ 


1 


M/21 


[d/2] (l + [d/2j) 


E E I{S]Mpi) < V2[d/2] 


[d/2] 


i=l j=M/2l 

Recalling this is an upper bound on Eq. (I24b . which is the second term in Eq. (|2^ . we get that 


IEm,- 


[Mj - Mj] 


> 


[d/2] [d/2] 


\dm ^ d 

4 - 8’ 


hence justifying Eq. 


□ 
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